Digital Humanities is often concerned with creating models of text: a general name for a kind of representation of text which makes it in some way easier to interpret. TEI-encoded text is an example of a model: we take the raw material of a text document and add elements to it to make it easier to work with and analyse. Models are often further abstracted from the original text. One way we can represent text in a way that a machine can interpret is with a word vector. A word vector is simply a numerical representation of a word within a corpus (a body of text, often a series of documents), usually consisting of a series of numbers in a specified sequence. This type of representation is used for a variety of Natural Language Processing tasks - for instance measuring the similarity between two documents. This post uses a couple of R packages and a method for creating word vectors with a neural net, called GloVe, to produce a series of vectors which give useful clues as to the semantic links between words in a corpus. The method is then used to analyse the printed summaries of the English State Papers, from State Papers Online, and show how they can be used to understand how the association between words and concepts changed over the course of the seventeenth century.
Imagine you have two documents in a corpus. One of them is an article about pets, and the other is a piece of fiction about a team of crime fighting animal superheroes. We’ll call them document A and document B. One way to represent the words within these documents as a vector would be to use the counts of each word per document.
To do this, you could give each word a set of coordinates, \(x\) and \(y\), where \(x\) is a count of how many times the word appears in document A and \(y\) the number of times it appears in document B.
The first step is to make a dataframe with the relevant counts:
library(ggrepel)
library(tidyverse)
word_vectors = tibble(word = c('crufts', 'feed', 'cat', 'dog', 'mouse', 'rabbit', 'cape', 'hero' ),
x = c(10, 8, 6, 5, 6, 5, 2, 1),
y = c(0, 1, 3, 5, 8, 8, 10, 9))
word_vectors
This data can be represented as a two-dimensional plot where each word is placed on the x and y axes based on their x and y values, like this:
ggplot() +
geom_point(data = word_vectors, aes(x, y), size =4, alpha = .7) +
geom_text_repel(data = word_vectors, aes(x, y, label = word)) +
theme_bw() +
labs(title = "Words Represented in Two-dimension Space") +
theme(title = element_text(face = 'bold')) +
scale_x_continuous(breaks = 1:10) +
scale_y_continuous(breaks = 1:10)
Each word is represented as a vector of length 2: ‘rabbit’ is a vector containing two numbers: {5,8}, for example. Using very basic maths we can calculate the euclidean distance between any pair of words. More or less the only thing I can remember from secondary school math is how to calculate the distance between two points on a graph, using the following formula:
\[ \sqrt {\left( {x_1 - x_2 } \right)^2 + \left( {y_1 - y_2 } \right)^2 } \]
Where \(x\) is the first point and \(y\) the second. This can easily be turned into a function in R, which takes a set of coordinates (the arguments x1 and x2) and returns the euclidean distance:
euc.dist <- function(x1, x2) sqrt(sum((pointA - pointB) ^ 2))
To get the distance between crufts and mouse, set pointA as the \(x\) and \(y\) ccoordinates for the first entry in the dataframe of coordinates we created above, and pointB the coordinates for the fifth entry:
pointA = c(word_vectors$x[1], word_vectors$y[1])
pointB = c(word_vectors$x[5], word_vectors$y[5])
euc.dist(pointA, pointB)
## [1] 8.944272
Representing a pair of words as vectors and measuring the distance between them is commonly used to suggest a semantic link between the two. For instance, the distance between hero and cape in this corpus is small, because they have similar properties: they both occur mostly in the document about superheroes and rarely in the document about pets.
pointA = c(word_vectors$x[word_vectors$word == 'hero'], word_vectors$y[word_vectors$word == 'hero'])
pointB = c(word_vectors$x[word_vectors$word == 'cape'], word_vectors$y[word_vectors$word == 'cape'])
euc.dist(pointA, pointB)
## [1] 1.414214
This suggests that the model has ‘learned’ that in this corpus, hero and cape are semantically more closely linked than other pairs in the dataset. The difference between cape and feed, on the other hand, is large, because one appears often in the superheroes article and rarely in the other, and vice versa.
pointA = c(word_vectors$x[word_vectors$word == 'cape'], word_vectors$y[word_vectors$word == 'cape'])
pointB = c(word_vectors$x[word_vectors$word == 'feed'], word_vectors$y[word_vectors$word == 'feed'])
euc.dist(pointA, pointB)
## [1] 10.81665
These vectors, each consisting of two numbers, can be thought of as two-dimensional vectors: a type which can be represented on a 2D scatterplot as \(x\) and \(y\). It’s very easy to add a third dimension, \(z\):
word_vectors_3d = tibble(word = c('crufts', 'feed', 'cat', 'dog', 'mouse', 'rabbit', 'cape', 'hero' ),
x = c(10, 8, 6, 5, 6, 5, 2, 1),
y = c(0, 1, 3, 5, 8, 8, 10, 9),
z = c(1,3,5,2,7,8,4,3))
Just like the plot above, we can plot the words, this time in in three dimensions, using Plotly:
library(plotly)
plot_ly(data = word_vectors_3d, x = ~x, y = ~y,z = ~z, text = ~word) %>% add_markers()
You can start to understand how the words now cluster together in the 3D plot: rabbit and mouse are clustered together, but now in the third dimension they are further away from dog. We can use the same formula as above to calculate these distances, just by adding the z coordinates to the pointA and pointB vectors:
pointA = c(word_vectors$x[word_vectors$word == 'dog'], word_vectors$y[word_vectors$word == 'dog'], word_vectors$z[word_vectors$word == 'dog'])
## Warning: Unknown or uninitialised column: `z`.
pointB = c(word_vectors$x[word_vectors$word == 'mouse'], word_vectors$y[word_vectors$word == 'mouse'], word_vectors$z[word_vectors$word == 'mouse'])
## Warning: Unknown or uninitialised column: `z`.
euc.dist(pointA, pointB)
## [1] 3.162278
The nice thing about the method is that while my brain starts to hurt when I think about more than three dimensions, the maths behind it doesn’t care: you can just keep plugging in longer and longer vectors and it’ll continue to calculate the distances as long as they are the same length. This means you can use this same formula not just when you have x and y coordinates, but also z, a, b, c, d, and so on for as long as you like. This is often called ‘representing words in multi-dimensional euclidean space’, or something similar which sounds great on grant applications but it’s really just doing some plotting and measuring distances. Which means that if you represent all the words in a corpus as a long vector (series of coordinates), you can quickly measure the distance between any two.
In a large corpus with a properly-constructed vector representation, the semantic relationships between the words start to make a lot of sense. What’s more, because of vector math, you can add, subtract, divide and multiply the words together to get new vectors, and then find the closest to that. Here, we create a new vector, which is pointA - pointB (dog - mouse). Then loop through each vector and calculate the distance, and display in a new dataframe:
pointC = pointA - pointB
df_for_results = tibble()
for(i in 1:8){
pointA = c(word_vectors$x[i], word_vectors$y[i], word_vectors$z[i])
u = tibble(dist = euc.dist(pointC, pointA), word = word_vectors$word[i])
df_for_results = rbind(df_for_results, u)
}
## Warning: Unknown or uninitialised column: `z`.
## Warning: Unknown or uninitialised column: `z`.
## Warning: Unknown or uninitialised column: `z`.
## Warning: Unknown or uninitialised column: `z`.
## Warning: Unknown or uninitialised column: `z`.
## Warning: Unknown or uninitialised column: `z`.
## Warning: Unknown or uninitialised column: `z`.
## Warning: Unknown or uninitialised column: `z`.
df_for_results %>% arrange(dist)
The closest to dog - mouse is hero, with this vector representation.
These vectors are also known as word embeddings. Real algorithms base the vectors on more sophisticated metrics than that I used above. Some, such as GloVe record co-occurrence probabilities (the likelihood of every pair of words in a corpus to co-occur within a set ‘window’ of words either side), using a neural network, and pre-trained over enormous corpora of text. The resulting vectors are often used to represent the relationships between modern meanings of words, to track semantic changes over time, or to understand the history of concepts, though it’s worth pointing out they’re only as representative as the corpus used (many use sources such as Wikipedia, or Reddit, mostly produced by white men and so there’s a danger of biases towards those groups).
Word embeddings are often critiqued as reflecting or propogating bias (I highly recommend Kaspar Beelen’s post and tools to understand more about this) of their source texts. The source used here is a corpus consisting of the printed summaries of the Calendars of State Papers, which I’ve described in detail here. As such it is likely highly biased, but if the purpose of an analysis is historical, for example to understand how a concept was represented at a given time, by a specific group, in a particular body of text, the biases captured by word embeddings can be seen as a research strength rather than a weakness. The data is in no way representative of early modern text more generally, and, what’s more, the summaries were written in the 19th century and so will reflect what editors at the time thought was important. In these two ways, the corpus will reproduce a very particular wordview of a very specific group, at a very specific time. Because of this, can use the embeddings to get an idea of how certain words or ideas were semantically linked, specifically in the corpus of calendar abstracts. The data will not show us how early modern concepts were related, but it might show conceptual changes in words within the information apparatus of the state.
The following instructions are adapted from the project vignette and this tutorial. First, tokenise all the abstract text and remove very common words called stop words:
library(text2vec)
library(tidytext)
library(textstem)
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
##
## available.koRpus.lang()
##
## and see ?install.koRpus.lang()
##
## Attaching package: 'koRpus'
## The following object is masked from 'package:readr':
##
## tokenize
data("stop_words")
Next, load and pre-process the abstract text:
spo_raw = read_delim('/Users/yannryanpersonal/Documents/blog_posts/MOST RECENT DATA/fromto_all_place_mapped_stuart_sorted', delim = '\t', col_names = F )
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_double(),
## X2 = col_double(),
## X3 = col_character(),
## X4 = col_character(),
## X5 = col_character(),
## X6 = col_character(),
## X7 = col_character(),
## X8 = col_character()
## )
spo_mapped_people = read_delim('/Users/yannryanpersonal/Documents/blog_posts/MOST RECENT DATA/people_docs_stuart_200421', delim = '\t', col_names = F)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_double(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character(),
## X5 = col_character(),
## X6 = col_character(),
## X7 = col_logical()
## )
load('/Users/yannryanpersonal/Documents/blog_posts/g')
g = g %>% group_by(path) %>% summarise(value = paste0(value, collapse = "<br>"))
## `summarise()` ungrouping output (override with `.groups` argument)
spo_raw = spo_raw %>%
mutate(X7 = str_replace(X7, "spo", "SPO")) %>%
separate(X7, into = c('Y1', 'Y2', 'Y3'), sep = '/') %>%
mutate(fullpath = paste0("/Users/Yann/Documents/non-Github/spo_xml/", Y1, '/XML/', Y2,"/", Y3)) %>% mutate(uniquecode = paste0("Z", 1:nrow(spo_raw), "Z"))
withtext = left_join(spo_raw, g, by = c('fullpath' = 'path')) %>%
left_join(spo_mapped_people %>% dplyr::select(X1, from_name = X2), by = c('X1' = 'X1'))%>%
left_join(spo_mapped_people %>% dplyr::select(X1, to_name = X2), by = c('X2' = 'X1'))
Tokenize the text using the Tidytext function unnest_tokens(), remove stop words, lemmatize the text (reduce the words to their stem) using textstem, and remove numbers. This creates a new dataset, with one row per word, plus.
words = withtext %>%
ungroup() %>%
select(document = X5, value, date = X3) %>%
unnest_tokens(word, value) %>% anti_join(stop_words)%>%
mutate(word = lemmatize_words(word)) %>%
filter(!str_detect(word, "[0-9]{1,}")) %>% mutate(word = str_remove(word, "\\'s"))
## Joining, by = "word"
Create a ‘vocabulary’, which is just a list of each word found in the dataset and the times they occur, and ‘prune’ it to only words which occur at least five times.
words_ls = list(words$word)
it = itoken(words_ls, progressbar = FALSE)
vocab = create_vocabulary(it)
vocab = prune_vocabulary(vocab, term_count_min = 5)
With the vocabulary, construct a ‘term co-occurence matrix’: this is a matrix of rows and columns, counting all the times each word co-occurs with every other word, within a window which can be set with the argument skip_grams_window =. 5 seems to give me good results - I think because many of the documents are so short.
vectorizer = vocab_vectorizer(vocab)
# use window of 10 for context words
tcm = create_tcm(it, vectorizer, skip_grams_window = 5)
Now use the GloVe algorithm to train the model and produce the vectors, with a set number of iterations: here we’ve used 20, which seems to give good results. rank here is the number of dimensions we want. x_max is the maximum number of co-occurrences the model will consider in total - giving it a relatively low maximum means that the whole thing won’t be skewed towards a small numbre of words that occur together hundreds of times. rank sets the number of dimensions in the result. The algorithm can be quite slow, but as it’s a relatively small dataset (in comparison to something like the entire English wikipedia), it shouldn’t take too long to run - a couple of minutes for 20 iterations.
glove = GlobalVectors$new(rank = 100, x_max = 100)
wv_main = glove$fit_transform(tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO [07:04:25.341] epoch 1, loss 0.0539
## INFO [07:04:38.503] epoch 2, loss 0.0318
## INFO [07:04:51.829] epoch 3, loss 0.0261
## INFO [07:05:05.224] epoch 4, loss 0.0234
## INFO [07:05:18.537] epoch 5, loss 0.0216
## INFO [07:05:31.924] epoch 6, loss 0.0204
## INFO [07:05:45.319] epoch 7, loss 0.0195
## INFO [07:05:58.691] epoch 8, loss 0.0187
## INFO [07:06:12.061] epoch 9, loss 0.0181
## INFO [07:06:25.444] epoch 10, loss 0.0176
## INFO [07:06:39.007] epoch 11, loss 0.0172
## INFO [07:06:52.559] epoch 12, loss 0.0168
## INFO [07:07:05.968] epoch 13, loss 0.0165
## INFO [07:07:19.290] epoch 14, loss 0.0162
## INFO [07:07:32.628] epoch 15, loss 0.0159
## INFO [07:07:46.021] epoch 16, loss 0.0157
## INFO [07:07:59.371] epoch 17, loss 0.0155
## INFO [07:08:12.716] epoch 18, loss 0.0153
## INFO [07:08:26.073] epoch 19, loss 0.0151
## INFO [07:08:39.417] epoch 20, loss 0.0149
GloVe results in two sets of word vectors, the main and the context. The authors of the GloVe package suggest that combining both results in higher-quality embeddings:
wv_context = glove$components
# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
word_vectors = wv_main + t(wv_context)
Now that’s done, it’d be nice to visualise the results as a whole. This isn’t actually necessary: as I mentioned earlier, the computer doesn’t care how many dimensions you give it to work out the distances between words. However, in order to visualise the results as whole, we can reduce the 100 dimensions to two or three and plot the results. We can do this with an algorithm called UMAP
There are a number of parameters which can be set - most important is n_components which should be set to two or three so that the results can be plotted.
library(umap)
glove_umap <- umap(word_vectors, n_components = 3, metric = "cosine", n_neighbors = 25, min_dist = 0.01, spread=2)
df_glove_umap <- as.data.frame(glove_umap$layout, stringsAsFactors = FALSE)
# Add the labels of the words to the dataframe
df_glove_umap$word <- rownames(df_glove_umap)
colnames(df_glove_umap) <- c("UMAP1", "UMAP2", "UMAP3", "word")
df_glove_umap$technique <- 'GloVe'
cat(paste0('\n', 'Our GloVe embedding reduced to 2 dimensions:', '\n'))
##
## Our GloVe embedding reduced to 2 dimensions:
str(df_glove_umap)
## 'data.frame': 22253 obs. of 5 variables:
## $ UMAP1 : num -2.758 -0.926 -1.513 -0.749 0.164 ...
## $ UMAP2 : num -0.0562 1.2093 1.8558 0.2308 0.6367 ...
## $ UMAP3 : num -2.164 0.194 -3.106 2.76 0.53 ...
## $ word : chr "aalst" "aarsele" "abdias" "abernethy" ...
## $ technique: chr "GloVe" "GloVe" "GloVe" "GloVe" ...
Next, use Plotly as above to visualise the resulting three dimensions:
plot_ly(data = df_glove_umap, x = ~UMAP1, y = ~UMAP2, z = ~UMAP3, text = ~word, alpha = .2, size = .1) %>% add_markers()
When it’s finished, write a small function which calculates and displays the closest words in distance to a given word. Instead of using the euclidean distance formula above, we calculate the cosine similarity, which measures the angular distance between the words (this is better because it corrects for one word appearing many times and another appearing very infrequently).
ten_closest_words = function(word){
word_result = word_vectors[word, , drop = FALSE]
cos_sim = sim2(x = word_vectors, y = word_result, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 30)
}
The function takes a single word as an argument and returns the twenty closest word vectors, by cosine distance. What are the closest in useage to ‘king’?
ten_closest_words('king')
## king majesty england queen lord leave desire
## 1.0000000 0.8638741 0.7727203 0.7408834 0.7375371 0.7350961 0.7342174
## late hope understand prince promise command hear
## 0.7268651 0.7255197 0.7242622 0.7237931 0.7223181 0.7198124 0.7177404
## please favour time receive duke grant return
## 0.7168768 0.7156008 0.7134387 0.7122003 0.7120294 0.7104037 0.7092559
## letter pray intend tell br bring inform
## 0.7069623 0.7064916 0.7001860 0.6952622 0.6943447 0.6849711 0.6843482
## answer service
## 0.6826238 0.6813778
Unsurprisingly, a word that is often interchangeable with King, Majesty, is the closest, followed by ‘Queen’ - also obviously interchangeable with King, depending on the circumstances.
Word embeddings are often used to understand different and changing gender representations. How are gendered words represented in the State Papers abstracts? First of all, wife:
ten_closest_words('wife')
## wife child husband sister daughter lady brother marry
## 1.0000000 0.8251752 0.8014678 0.7684266 0.7494467 0.7361869 0.7356229 0.7318627
## mother son father family widow woman servant uncle
## 0.7223632 0.7043287 0.6813588 0.6689104 0.6684048 0.6494668 0.6306394 0.6206129
## friend live writer leave die life dead poor
## 0.6180055 0.6039532 0.5888265 0.5827822 0.5773624 0.5757947 0.5746942 0.5744416
## pray estate niece countess remember health
## 0.5529604 0.5508137 0.5408341 0.5368639 0.5334508 0.5332996
Unsurprisingly wife is most similar to other words relating to family. What about husband?
ten_closest_words('husband')
## husband wife child widow father woman
## 1.0000000 0.8014678 0.7618610 0.7226149 0.6372039 0.6152997
## mother servant imprisonment son daughter marry
## 0.6110061 0.6067118 0.6057760 0.6030784 0.6018447 0.6010796
## debt brother petitioner sister lady prisoner
## 0.5966612 0.5907385 0.5898349 0.5787378 0.5779477 0.5770660
## release decease die family access late
## 0.5753734 0.5749962 0.5731349 0.5682910 0.5578974 0.5538662
## death estate petition life liberty dead
## 0.5522774 0.5430116 0.5421756 0.5248609 0.5228890 0.5221336
Husband is mostly similar but with some interesting different associations: ‘widow’, ‘die’, ‘petition’, ‘debt’, and ‘prisoner’, reflecting the fact that there is a large group of petitions in the State Papers written by women looking for pardons or clemency for their husbands, particularly following the Monmouth Rebellion in 1683.
Looking at the closest words to place names gives some interesting associations. Amsterdam is associated with terms related to shipping and trade:
ten_closest_words('amsterdam')
## amsterdam rotterdam lade bordeaux merchant bind
## 1.0000000 0.7843269 0.6418232 0.5993143 0.5912159 0.5540331
## vessel holland french flush london hamburg
## 0.5517031 0.5415674 0.5298537 0.5281513 0.5192847 0.5168510
## dutch dutchman richly prize merchantmen texel
## 0.5163422 0.5129344 0.5121920 0.5098451 0.5085725 0.5051106
## nantes sugar malo salt hoy sail
## 0.5013724 0.4991486 0.4950068 0.4943924 0.4911585 0.4891844
## arrive bilboa ship ostend english privateer
## 0.4885341 0.4851528 0.4845449 0.4842282 0.4824040 0.4792499
Whereas Rome is very much associated with religion and ecclesiastical politics:
ten_closest_words('rome')
## rome pope friar jesuit spain
## 1.0000000 0.6710667 0.5532513 0.5459079 0.5364638
## paris nuncio courier ambassador venice
## 0.5209065 0.5043323 0.4802523 0.4785569 0.4780924
## priest germany italy england tyrone
## 0.4729829 0.4671189 0.4633001 0.4626559 0.4480714
## cardinal church naples advertisement france
## 0.4451541 0.4429859 0.4420447 0.4350839 0.4344935
## bull depart catholic mass religion
## 0.4250987 0.4195093 0.4188933 0.4115858 0.4103830
## archdukes vienna shortly emperor brussels
## 0.4089576 0.4078097 0.4014825 0.4006026 0.3967409
As well as finding the most similar words, we can also perform arithmetic on the vectors. What is the closest word to book and news:
sum = word_vectors["book", , drop = F] +
word_vectors["news", , drop = F]
cos_sim_test = sim2(x = word_vectors, y = sum, method = "cosine", norm = "l2")
head(sort(cos_sim_test[,1], decreasing = T), 20)
## news book post account write send letter
## 0.8198345 0.8004811 0.6792542 0.6754230 0.6741882 0.6653849 0.6618579
## hear bring williamson enclose hope england print
## 0.6519728 0.6433154 0.6388822 0.6314165 0.6198477 0.6180945 0.6152369
## day return note hand week leave
## 0.6151938 0.6150748 0.6055430 0.6024733 0.6016956 0.6010381
It is also a way of finding analogies: so, for example, Paris - France + Germany should equal to ‘Berlin’, because Berlin is like the Paris of France. Is that what we get?
test = word_vectors["paris", , drop = F] -
word_vectors["france", , drop = F] +
word_vectors["germany", , drop = F]
#+
# shakes_word_vectors["letter", , drop = F]
cos_sim_test = sim2(x = word_vectors, y = test, method = "cosine", norm = "l2")
head(sort(cos_sim_test[,1], decreasing = T), 20)
## paris germany n.s madrid ps
## 0.6679108 0.6246328 0.5039738 0.4432661 0.4409361
## ratisbon bursar frankfort koningsmarck coe
## 0.4209300 0.4045458 0.3960551 0.3905299 0.3893929
## advertisement style hague brussels vienna
## 0.3889908 0.3813583 0.3787224 0.3697133 0.3639487
## shirt occurrence remarkable cypher ernley
## 0.3634198 0.3595532 0.3575269 0.3573064 0.3494931
After Germany and Paris, the most similar to Paris - France + Germany is Brussels: not the correct answer, but a close enough guess!
We can try other analogies: pen - letter + book should in theory give some word related to printing and book production such as print, or press, or maybe type (Think pen is to letter as X is to book).
test = word_vectors["pen", , drop = F] -
word_vectors["letter", , drop = F] +
word_vectors["book", , drop = F]
cos_sim_test = sim2(x = word_vectors, y = test, method = "cosine", norm = "l2")
head(sort(cos_sim_test[,1], decreasing = T), 20)
## pen ink book pamphlet manuscript wrangle unlicensed
## 0.5655214 0.5241191 0.5108703 0.4790678 0.4307206 0.4299565 0.4262404
## quire chop bible edgecombe ream vii barlowe
## 0.4190708 0.4188197 0.4159428 0.4116286 0.4105579 0.4077308 0.4049325
## cloak bundle schism liquor fool dredge
## 0.3986532 0.3899056 0.3884587 0.3878514 0.3875513 0.3874864
Not bad - printer is in the top 20! The closest is ink, plus some other book-production-related words like pamphlet. Though some of these words can also be associated with manuscript production, we could be generous and say that they are sort of to a book as a pen is to a letter!
We can also look for change in semantic meaning over time. First, divide the text into four separate sections, one for each reign:
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
james_i = withtext %>%
mutate(year = year(ymd(X4))) %>%
filter(year %in% 1603:1624) %>%
ungroup() %>%
select(document = X5, value, date = X3) %>%
unnest_tokens(word, value) %>%
anti_join(stop_words) %>%
mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ 24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.
## Warning: 24 failed to parse.
## Joining, by = "word"
charles_i = withtext %>%
mutate(year = year(ymd(X4))) %>%
filter(year %in% 1625:1648) %>%
ungroup() %>%
select(document = X5, value, date = X3) %>%
unnest_tokens(word, value) %>% anti_join(stop_words)%>%
mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ 24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.
## Warning: 24 failed to parse.
## Joining, by = "word"
commonwealth = withtext %>%
mutate(year = year(ymd(X4))) %>%
filter(year %in% 1649:1659) %>%
ungroup() %>%
select(document = X5, value, date = X3) %>%
unnest_tokens(word, value) %>% anti_join(stop_words)%>%
mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ 24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.
## Warning: 24 failed to parse.
## Joining, by = "word"
charles_ii = withtext %>%
mutate(year = year(ymd(X4))) %>%
filter(year %in% 1660:1684) %>%
ungroup() %>%
select(document = X5, value, date = X3) %>%
unnest_tokens(word, value) %>% anti_join(stop_words)%>%
mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ 24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.
## Warning: 24 failed to parse.
## Joining, by = "word"
james_ii_w_m_ann = withtext %>%
mutate(year = year(ymd(X4))) %>%
filter(year %in% 1685:1714) %>%
ungroup() %>%
select(document = X5, value, date = X3) %>%
unnest_tokens(word, value) %>% anti_join(stop_words) %>%
mutate(word = lemmatize_words(word)) %>% filter(!str_detect(word, "[0-9]{1,}"))
## Warning: Problem with `mutate()` input `year`.
## ℹ 24 failed to parse.
## ℹ Input `year` is `year(ymd(X4))`.
## Warning: 24 failed to parse.
## Joining, by = "word"
Now run the same scripts as above, on each of these sections:
james_i_words_ls = list(james_i$word)
it = itoken(james_i_words_ls, progressbar = FALSE)
james_i_vocab = create_vocabulary(it)
james_i_vocab = prune_vocabulary(james_i_vocab, term_count_min = 5)
vectorizer = vocab_vectorizer(james_i_vocab)
# use window of 10 for context words
james_i_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)
james_i_glove = GlobalVectors$new(rank = 100, x_max = 100)
james_i_wv_main = james_i_glove$fit_transform(james_i_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO [07:13:51.511] epoch 1, loss 0.0491
## INFO [07:13:54.733] epoch 2, loss 0.0291
## INFO [07:13:57.951] epoch 3, loss 0.0231
## INFO [07:14:01.154] epoch 4, loss 0.0203
## INFO [07:14:04.353] epoch 5, loss 0.0185
## INFO [07:14:07.563] epoch 6, loss 0.0172
## INFO [07:14:10.778] epoch 7, loss 0.0162
## INFO [07:14:13.984] epoch 8, loss 0.0154
## INFO [07:14:17.178] epoch 9, loss 0.0147
## INFO [07:14:20.412] epoch 10, loss 0.0142
## INFO [07:14:23.623] epoch 11, loss 0.0137
## INFO [07:14:26.837] epoch 12, loss 0.0133
## INFO [07:14:30.065] epoch 13, loss 0.0129
## INFO [07:14:33.294] epoch 14, loss 0.0126
## INFO [07:14:36.509] epoch 15, loss 0.0123
## INFO [07:14:39.702] epoch 16, loss 0.0120
## INFO [07:14:42.909] epoch 17, loss 0.0118
## INFO [07:14:46.120] epoch 18, loss 0.0115
## INFO [07:14:49.328] epoch 19, loss 0.0113
## INFO [07:14:52.560] epoch 20, loss 0.0111
james_i_wv_context = james_i_glove$components
james_i_word_vectors = james_i_wv_main + t(james_i_wv_context)
charles_i_words_ls = list(charles_i$word)
it = itoken(charles_i_words_ls, progressbar = FALSE)
charles_i_vocab = create_vocabulary(it)
charles_i_vocab = prune_vocabulary(charles_i_vocab, term_count_min = 5)
vectorizer = vocab_vectorizer(charles_i_vocab)
# use window of 10 for context words
charles_i_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)
charles_i_glove = GlobalVectors$new(rank = 100, x_max = 100)
charles_i_wv_main = charles_i_glove$fit_transform(charles_i_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO [07:15:01.562] epoch 1, loss 0.0537
## INFO [07:15:05.691] epoch 2, loss 0.0298
## INFO [07:15:09.677] epoch 3, loss 0.0234
## INFO [07:15:13.700] epoch 4, loss 0.0207
## INFO [07:15:17.690] epoch 5, loss 0.0189
## INFO [07:15:21.675] epoch 6, loss 0.0176
## INFO [07:15:25.658] epoch 7, loss 0.0166
## INFO [07:15:29.670] epoch 8, loss 0.0158
## INFO [07:15:33.694] epoch 9, loss 0.0152
## INFO [07:15:37.730] epoch 10, loss 0.0146
## INFO [07:15:41.763] epoch 11, loss 0.0141
## INFO [07:15:45.776] epoch 12, loss 0.0137
## INFO [07:15:49.800] epoch 13, loss 0.0134
## INFO [07:15:53.845] epoch 14, loss 0.0130
## INFO [07:15:57.842] epoch 15, loss 0.0127
## INFO [07:16:01.910] epoch 16, loss 0.0125
## INFO [07:16:05.911] epoch 17, loss 0.0122
## INFO [07:16:09.882] epoch 18, loss 0.0120
## INFO [07:16:13.854] epoch 19, loss 0.0118
## INFO [07:16:17.817] epoch 20, loss 0.0116
charles_i_wv_context = charles_i_glove$components
charles_i_word_vectors = charles_i_wv_main + t(charles_i_wv_context)
commonwealth_words_ls = list(commonwealth$word)
it = itoken(commonwealth_words_ls, progressbar = FALSE)
commonwealth_vocab = create_vocabulary(it)
commonwealth_vocab = prune_vocabulary(commonwealth_vocab, term_count_min = 5)
vectorizer = vocab_vectorizer(commonwealth_vocab)
# use window of 10 for context words
commonwealth_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)
commonwealth_glove = GlobalVectors$new(rank = 100, x_max = 100)
commonwealth_wv_main = commonwealth_glove$fit_transform(commonwealth_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO [07:16:21.651] epoch 1, loss 0.0547
## INFO [07:16:23.346] epoch 2, loss 0.0315
## INFO [07:16:25.043] epoch 3, loss 0.0248
## INFO [07:16:26.747] epoch 4, loss 0.0215
## INFO [07:16:28.480] epoch 5, loss 0.0193
## INFO [07:16:30.188] epoch 6, loss 0.0178
## INFO [07:16:31.882] epoch 7, loss 0.0167
## INFO [07:16:33.589] epoch 8, loss 0.0157
## INFO [07:16:35.297] epoch 9, loss 0.0150
## INFO [07:16:37.015] epoch 10, loss 0.0143
## INFO [07:16:38.720] epoch 11, loss 0.0138
## INFO [07:16:40.415] epoch 12, loss 0.0133
## INFO [07:16:42.126] epoch 13, loss 0.0129
## INFO [07:16:43.818] epoch 14, loss 0.0125
## INFO [07:16:45.518] epoch 15, loss 0.0121
## INFO [07:16:47.224] epoch 16, loss 0.0118
## INFO [07:16:48.938] epoch 17, loss 0.0115
## INFO [07:16:50.671] epoch 18, loss 0.0113
## INFO [07:16:52.384] epoch 19, loss 0.0110
## INFO [07:16:54.108] epoch 20, loss 0.0108
commonwealth_wv_context = commonwealth_glove$components
# dim(shakes_wv_context)
# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
commonwealth_word_vectors = commonwealth_wv_main + t(commonwealth_wv_context)
charles_ii_words_ls = list(charles_ii$word)
it = itoken(charles_ii_words_ls, progressbar = FALSE)
charles_ii_vocab = create_vocabulary(it)
charles_ii_vocab = prune_vocabulary(charles_ii_vocab, term_count_min = 5)
vectorizer = vocab_vectorizer(charles_ii_vocab)
# use window of 10 for context words
charles_ii_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)
charles_ii_glove = GlobalVectors$new(rank = 100, x_max = 100)
charles_ii_wv_main = charles_ii_glove$fit_transform(charles_ii_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO [07:17:07.576] epoch 1, loss 0.0503
## INFO [07:17:13.682] epoch 2, loss 0.0286
## INFO [07:17:19.731] epoch 3, loss 0.0231
## INFO [07:17:25.800] epoch 4, loss 0.0205
## INFO [07:17:31.863] epoch 5, loss 0.0188
## INFO [07:17:38.012] epoch 6, loss 0.0176
## INFO [07:17:44.127] epoch 7, loss 0.0167
## INFO [07:17:50.195] epoch 8, loss 0.0159
## INFO [07:17:56.319] epoch 9, loss 0.0153
## INFO [07:18:03.664] epoch 10, loss 0.0148
## INFO [07:18:15.506] epoch 11, loss 0.0144
## INFO [07:18:27.367] epoch 12, loss 0.0140
## INFO [07:18:36.653] epoch 13, loss 0.0137
## INFO [07:18:42.707] epoch 14, loss 0.0134
## INFO [07:18:48.747] epoch 15, loss 0.0131
## INFO [07:18:54.798] epoch 16, loss 0.0129
## INFO [07:19:04.634] epoch 17, loss 0.0127
## INFO [07:19:16.181] epoch 18, loss 0.0125
## INFO [07:19:26.842] epoch 19, loss 0.0123
## INFO [07:19:36.290] epoch 20, loss 0.0121
charles_ii_wv_context = charles_ii_glove$components
# dim(shakes_wv_context)
# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
charles_ii_word_vectors = charles_ii_wv_main + t(charles_ii_wv_context)
james_ii_w_m_ann_words_ls = list(james_ii_w_m_ann$word)
it = itoken(james_ii_w_m_ann_words_ls, progressbar = FALSE)
james_ii_w_m_ann_vocab = create_vocabulary(it)
james_ii_w_m_ann_vocab = prune_vocabulary(james_ii_w_m_ann_vocab, term_count_min = 5)
vectorizer = vocab_vectorizer(james_ii_w_m_ann_vocab)
# use window of 10 for context words
james_ii_w_m_ann_tcm = create_tcm(it, vectorizer, skip_grams_window = 5)
james_ii_w_m_ann_glove = GlobalVectors$new(rank = 100, x_max = 100)
james_ii_w_m_ann_wv_main = james_ii_w_m_ann_glove$fit_transform(james_ii_w_m_ann_tcm, n_iter = 20, convergence_tol = 0.00001)
## INFO [07:19:41.941] epoch 1, loss 0.0512
## INFO [07:19:44.506] epoch 2, loss 0.0294
## INFO [07:19:47.146] epoch 3, loss 0.0235
## INFO [07:19:49.760] epoch 4, loss 0.0206
## INFO [07:19:52.385] epoch 5, loss 0.0187
## INFO [07:19:55.017] epoch 6, loss 0.0174
## INFO [07:19:57.628] epoch 7, loss 0.0163
## INFO [07:20:00.242] epoch 8, loss 0.0155
## INFO [07:20:02.864] epoch 9, loss 0.0148
## INFO [07:20:07.515] epoch 10, loss 0.0142
## INFO [07:20:12.153] epoch 11, loss 0.0137
## INFO [07:20:16.802] epoch 12, loss 0.0133
## INFO [07:20:21.473] epoch 13, loss 0.0129
## INFO [07:20:26.118] epoch 14, loss 0.0126
## INFO [07:20:30.826] epoch 15, loss 0.0122
## INFO [07:20:35.493] epoch 16, loss 0.0120
## INFO [07:20:40.163] epoch 17, loss 0.0117
## INFO [07:20:44.831] epoch 18, loss 0.0115
## INFO [07:20:49.503] epoch 19, loss 0.0113
## INFO [07:20:54.145] epoch 20, loss 0.0111
james_ii_w_m_ann_wv_context = james_ii_w_m_ann_glove$components
# dim(shakes_wv_context)
# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
james_ii_w_m_ann_word_vectors = james_ii_w_m_ann_wv_main + t(james_ii_w_m_ann_wv_context)
Write a function as above, this time with two arguments, so we can specify both the word and the relevant reign:
top_ten_function = function(word, period){
if(period == 'james_i'){
vectors = james_i_word_vectors[word, , drop = FALSE]
cos_sim = sim2(x = james_i_word_vectors, y = vectors, method = "cosine", norm = "l2")
}
else if(period == 'charles_i'){ vectors = charles_i_word_vectors[word, , drop = FALSE]
cos_sim = sim2(x = charles_i_word_vectors, y = vectors, method = "cosine", norm = "l2")
}
else if(period == 'commonwealth') {
vectors = commonwealth_word_vectors[word, , drop = FALSE]
cos_sim = sim2(x = commonwealth_word_vectors, y = vectors, method = "cosine", norm = "l2")
}
else if(period == 'charles_ii'){
vectors = charles_ii_word_vectors[word, , drop = FALSE]
cos_sim = sim2(x = charles_ii_word_vectors, y = vectors, method = "cosine", norm = "l2")
}
else {
vectors = james_ii_w_m_ann_word_vectors[word, , drop = FALSE]
cos_sim = sim2(x = james_ii_w_m_ann_word_vectors, y = vectors, method = "cosine", norm = "l2")
}
head(sort(cos_sim[,1], decreasing = TRUE), 20)
}
Write a second function, which takes a word and returns the ten closest words for each reign:
first_in_each= function(word) {
rbind(top_ten_function(word, 'james_i') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='james_i' ),
top_ten_function(word, 'charles_i') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='charles_i' ),
top_ten_function(word, 'commonwealth') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='commonwealth' ),
top_ten_function(word, 'charles_ii') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='charles_ii' ),
top_ten_function(word, 'james_ii_w_m_ann') %>% tibble::enframe() %>% arrange(desc(value)) %>% slice(2:11) %>% mutate(reign ='james_ii_w_m_ann' ))%>%
group_by(reign) %>%
mutate(rank = rank(value)) %>%
ggplot() +
geom_text(aes(x = factor(reign, levels = c('james_i', 'charles_i', 'commonwealth', 'charles_ii', 'james_ii', 'james_ii_w_m_ann')), y = rank, label = name, color = name)) + theme_void() +
theme(legend.position = 'none',
axis.text.x = element_text(face = 'bold'),
)
}
This can show us the changing associations of particular words over time. Take ‘match’:
first_in_each('match')
In the reign of James I, ‘match’ is semantically linked to words relating to the Spanish Match: a proposed match between Charles I and the Infanta Maria Anna of Spain. During Charles I’s reign and afterwards, the meaning changes completely - now the closest words are all military. In the final section of the data, the semantic link returns again to mostly words about marriage - this time it’s not so obvious why the words are associated, but it’s probably relating to the marriage of Philippe II, Duke of Orléans to Françoise Marie de Bourbon, in 1692 - Philippe II was regent of France until 1723.
The primary purpose of this technique in the ‘real world’ isn’t really to understand the semantic relationship between words for its own sake, but rather is most often used as part of an NLP pipeline, where the embeddings are fed through a neural net to make predictions about text.
However, the word embeddings trained on the text of the Calendars is still a useful way to think about how these texts are constructed and the sort of ‘mental map’ they represent. We’ve seen that it often produces expected results (such as King being closest to Majesty), even in complex tasks: with the analogy pen is to letter as X is to book, X is replaced by ink, printer, pamphlet, and some other relevant book-production words. Certain words can be seen to change over time: match is a good example, which is linked to marriage at some times, and weaponry at others, depending on the time period. Many of these word associations reflect biases in the data, but in certain circumstances this can be a strength rather than a weakness. The danger is not investigating the biases, but rather when we are reductive and try to claim that the word associations seen here are in any way representative of how society at large thought about these concepts more generally. On their own terms, the embeddings can be a powerful historical tool to understand the linked meanings within a discrete set of sources.